========================================================
This report examines a dataset relating to red variants of a Portugese wine. Although I’m a big fan of red wine, I know very little about what constitutes a high quality wine, and I’m eager to learn more.
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
This dataset consists of about 1600 observations and 11+ variables.
After looking at the summary data and this first histogram of the distribution of quality, one of the first things that I saw is that the majority of wines fall in the the middle of the quality scale. What chemical chararacteristics do wines share on the higher end of the scale (7 or 8)?
Next I wanted to get a bit more familiar with the other variables. I did some research to learn a little more about some of the chemical properties listed so that I wauld know what I was looking at. I divided the variables into smaller groups, though I don’t mean to imply any specific correlations between the the variables in the same groupings.
Acids
Then I adjusted the plots a little to zoom in on the data.
Next I looked at residual sugar and chlorides:
Two very long tails here!
Let’s zoom in.
When reading up on the significance of residual sugar in wine, I learned that the amount of residual sugar determines how “dry” or “sweet” a wine is considered to be. Based on levels I found on a chart online that characterizes wines based on their residual sugar levels, I created a new variable, “sweetness”, which displays whether a wine is characterized as “Dry”, “Off Dry”, “Medium Dry”, “Medium Sweet”, “Sweet” or “Luscious”. Surprisingly, the vast majority of the wines in the dataset are “Dry”" wines (95%), and the rest are “Off Dry”. There were no other catagories represented under sweetness.
## # A tibble: 2 x 2
## sweetness n
## <fctr> <int>
## 1 Dry 1515
## 2 Off Dry 84
After reading about the controversial wine additives, sulfites, (and the contaminants, sulfates) I was curious to see the levels in this sample. Here’s a look at sulfites/sulfates:
Let’s zoom in…
General Properties
Accoding to Wine Spectator, the ideal pH levels for red wine are between 3.3 to 3.6. I am also interested in seeing if their is any correlation between quality and density or alcohol content. Here is quick look at these properties…
There are 1599 observations in this dataset. We will be looking at 12 variables, all numeric, except for ‘quality’, which is an integer.
The main features of interest for me in this dataset are the acids, the residual sugar and level of sweetness, sulfites and pH, density and alcohol percentage. I’m curious about how and if these variables influence quality.
Including the quality variable in the analysis with other variables will be illuminating, and I’m also hoping to discover if the degree of sweetness has any effect on the perceived quality of the wine.
I created a “sweetness” variable based on the levels of residual sugar.
Other than creating a variable for sweetness, I did not perform any operations to tidy, adjust, or change the form of the data.
I’m interested in seeing if any of the variables seem to effect the quality variable. I’m also curious to see if sweetness has any correlation with quality.
First let’s look at the acid variables…
## # A tibble: 6 x 4
## quality fixed_acid_mean fixed_acid_median n
## <int> <dbl> <dbl> <int>
## 1 3 8.360000 7.50 10
## 2 4 7.779245 7.50 53
## 3 5 8.167254 7.80 681
## 4 6 8.347179 7.90 638
## 5 7 8.872362 8.80 199
## 6 8 8.566667 8.25 18
##
## Pearson's product-moment correlation
##
## data: wine$fixed.acidity and wine$quality
## t = 4.996, df = 1597, p-value = 6.496e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.07548957 0.17202667
## sample estimates:
## cor
## 0.1240516
The correlation between quality and fixed acidity appears weak
## # A tibble: 6 x 4
## quality volatile_acid_mean volatile_acid_median n
## <int> <dbl> <dbl> <int>
## 1 3 0.8845000 0.845 10
## 2 4 0.6939623 0.670 53
## 3 5 0.5770411 0.580 681
## 4 6 0.4974843 0.490 638
## 5 7 0.4039196 0.370 199
## 6 8 0.4233333 0.370 18
##
## Pearson's product-moment correlation
##
## data: wine$volatile.acidity and wine$quality
## t = -16.954, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4313210 -0.3482032
## sample estimates:
## cor
## -0.3905578
There is a moderately negative relationship between quality and volatile acidity.
## # A tibble: 6 x 4
## quality citric_acid_mean citric_acid_median n
## <int> <dbl> <dbl> <int>
## 1 3 0.1710000 0.035 10
## 2 4 0.1741509 0.090 53
## 3 5 0.2436858 0.230 681
## 4 6 0.2738245 0.260 638
## 5 7 0.3751759 0.400 199
## 6 8 0.3911111 0.420 18
##
## Pearson's product-moment correlation
##
## data: wine$citric.acid and wine$quality
## t = 9.2875, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1793415 0.2723711
## sample estimates:
## cor
## 0.2263725
There is a positive (but relatively weak) relationship between quality and citric acid.
## # A tibble: 6 x 4
## quality residual_sugar_mean residual_sugar_median n
## <int> <dbl> <dbl> <int>
## 1 3 2.635000 2.1 10
## 2 4 2.694340 2.1 53
## 3 5 2.528855 2.2 681
## 4 6 2.477194 2.2 638
## 5 7 2.720603 2.3 199
## 6 8 2.577778 2.1 18
##
## Pearson's product-moment correlation
##
## data: wine$residual.sugar and wine$quality
## t = 0.5488, df = 1597, p-value = 0.5832
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.03531327 0.06271056
## sample estimates:
## cor
## 0.01373164
There are a lot of outliers in the 5, 6 and 7 quality range, but there is no strong correlation between quality and residual sugar.
## # A tibble: 6 x 4
## quality chlorides_mean chlorides_median n
## <int> <dbl> <dbl> <int>
## 1 3 0.12250000 0.0905 10
## 2 4 0.09067925 0.0800 53
## 3 5 0.09273568 0.0810 681
## 4 6 0.08495611 0.0780 638
## 5 7 0.07658794 0.0730 199
## 6 8 0.06844444 0.0705 18
##
## Pearson's product-moment correlation
##
## data: wine$chlorides and wine$quality
## t = -5.1948, df = 1597, p-value = 2.313e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.17681041 -0.08039344
## sample estimates:
## cor
## -0.1289066
There are a lot of outliers in the 5 and 6 quality range, but generally there is no strong correlation between chlorides and quality.
## # A tibble: 6 x 4
## quality free_sulfur_mean free_sulfur_median n
## <int> <dbl> <dbl> <int>
## 1 3 11.00000 6.0 10
## 2 4 12.26415 11.0 53
## 3 5 16.98385 15.0 681
## 4 6 15.71160 14.0 638
## 5 7 14.04523 11.0 199
## 6 8 13.27778 7.5 18
##
## Pearson's product-moment correlation
##
## data: wine$free.sulfur.dioxide and wine$quality
## t = -2.0269, df = 1597, p-value = 0.04283
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.099430290 -0.001638987
## sample estimates:
## cor
## -0.05065606
There is is no strong correlaation between free sulfur dioxide and quality.
## # A tibble: 6 x 4
## quality total_sulfur_mean total_sulfur_median n
## <int> <dbl> <dbl> <int>
## 1 3 24.90000 15.0 10
## 2 4 36.24528 26.0 53
## 3 5 56.51395 47.0 681
## 4 6 40.86991 35.0 638
## 5 7 35.02010 27.0 199
## 6 8 33.44444 21.5 18
##
## Pearson's product-moment correlation
##
## data: wine$total.sulfur.dioxide and wine$quality
## t = -7.5271, df = 1597, p-value = 8.622e-14
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2320162 -0.1373252
## sample estimates:
## cor
## -0.1851003
There is no strong correlation between quality and total sulfur dioxide.
## # A tibble: 6 x 4
## quality sulphates_mean sulphates_median n
## <int> <dbl> <dbl> <int>
## 1 3 0.5700000 0.545 10
## 2 4 0.5964151 0.560 53
## 3 5 0.6209692 0.580 681
## 4 6 0.6753292 0.640 638
## 5 7 0.7412563 0.740 199
## 6 8 0.7677778 0.740 18
##
## Pearson's product-moment correlation
##
## data: wine$sulphates and wine$quality
## t = 10.38, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2049011 0.2967610
## sample estimates:
## cor
## 0.2513971
Sulfate levels in this sample are slightly higher in the quality levels above 5. There is a correlation (though relatively weak) between sulfates and quality.
## # A tibble: 6 x 4
## quality density_mean density_median n
## <int> <dbl> <dbl> <int>
## 1 3 0.9974640 0.997565 10
## 2 4 0.9965425 0.996500 53
## 3 5 0.9971036 0.997000 681
## 4 6 0.9966151 0.996560 638
## 5 7 0.9961043 0.995770 199
## 6 8 0.9952122 0.994940 18
##
## Pearson's product-moment correlation
##
## data: wine$density and wine$quality
## t = -7.0997, df = 1597, p-value = 1.875e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2220365 -0.1269870
## sample estimates:
## cor
## -0.1749192
There is no strong correlation between density and quality.
## # A tibble: 6 x 4
## quality pH_mean pH_median n
## <int> <dbl> <dbl> <int>
## 1 3 3.398000 3.39 10
## 2 4 3.381509 3.37 53
## 3 5 3.304949 3.30 681
## 4 6 3.318072 3.32 638
## 5 7 3.290754 3.28 199
## 6 8 3.267222 3.23 18
##
## Pearson's product-moment correlation
##
## data: wine$pH and wine$quality
## t = -2.3109, df = 1597, p-value = 0.02096
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.106451268 -0.008734972
## sample estimates:
## cor
## -0.05773139
Interestingly, according to Wine Spectator, the best pH for red wines is between 3.3 and 3.6, which both the mean and the median are in between for all quality levels, other than the two best levels represented in this dataset, 7 and 8, where the mean and median levels are slightly less than desirable range. However there is a very weak correlation between pH and quality (-0.058).
## # A tibble: 6 x 4
## quality alcohol_mean alcohol_median n
## <int> <dbl> <dbl> <int>
## 1 3 9.955000 9.925 10
## 2 4 10.265094 10.000 53
## 3 5 9.899706 9.700 681
## 4 6 10.629519 10.500 638
## 5 7 11.465913 11.500 199
## 6 8 12.094444 12.150 18
##
## Pearson's product-moment correlation
##
## data: wine$alcohol and wine$quality
## t = 21.639, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4373540 0.5132081
## sample estimates:
## cor
## 0.4761663
There seems to be a substantial increase in alcohol levels as quality increases, and a fairly strong correlation between alcohol and quality.
Examining sweetness
First I plotted the proportion of each quality level in each level of
sweetness
But it was difficult for me to tell whether ‘dry’ or ‘off dry’ wines were of better quality. So I ran a summary of the mean and median quality levels of both:
## # A tibble: 2 x 4
## sweetness quality_mean quality_median n
## <fctr> <dbl> <dbl> <int>
## 1 Dry 5.631683 6 1515
## 2 Off Dry 5.714286 6 84
Clearly both ‘dry’ and ‘off dry’ wines are very close in average quality, so I know sweetness level is not important to level of quality in this dataset. This shouldn’t be too surprising after seeing how low the correlation coeffecient is between quality and residual sugar.
Alcohol and density
I was reminded while doing a little research on wine components that alcohol is less dense than water. So, I wanted to plot alcohol and density, expecting to see density decrease as alcohol level increased.
##
## Pearson's product-moment correlation
##
## data: wine$alcohol and wine$density
## t = -22.838, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.5322547 -0.4583061
## sample estimates:
## cor
## -0.4961798
Sure enough, as alcohol level increases, density decreases. There is a fairly strong correlation between alcohol and density (.496).
My main points of note in this investigation were: there is a moderate negative relationship between quality and volatile acidity. There is a relatively weak positive relationship between quality and citric acid. There is a relatively weak positive relationship between quality and sulphates. The best pH for red wines is between 3.3 and 3.6, which both the mean and the median are in between for all quality levels other than the two best levels represented in this dataset, 7 and 8, where the mean and median levels are slightly less than desirable range. However there is a very weak correlation between pH and quality (-0.058). There seems to be a substantial increase in alcohol levels as quality increases, and a fairly strong correlation between alcohol and quality. As alcohol level increases, density decreases. There is a fairly strong correlation between alcohol and density.
I was hoping to find an interesting correlation between sweetness and quality, but found none. I also examined alcohol and density, and found a fairly strong relationship between the two.
The strongest relationships I found were between alcohol and density (with a correlation coefficent of .496) and alcohol and quality (with a correlation coefficent of .476).
Now that I’ve found which variables have the more positive relationship with quality (alchohol, sulphates and citric acid) , I’m interestd in seeing how those variables relate to each other, and their combined effect on quality.
Alcohol, Citric Acid and Quality
##
## Pearson's product-moment correlation
##
## data: wine$alcohol and wine$citric.acid
## t = 4.4188, df = 1597, p-value = 1.059e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.06121189 0.15807276
## sample estimates:
## cor
## 0.1099032
This plot shows that there are many instances of higher quality wines that have higher citric acid or alcohol, independent of the other variable. The correlation between alcohol and citric acid proves to be weak (correlation coefficient is .110).
Sulphates, Citric Acid and Quality
##
## Pearson's product-moment correlation
##
## data: wine$sulphates and wine$citric.acid
## t = 13.159, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2678558 0.3563278
## sample estimates:
## cor
## 0.31277
It seems as if there are many higher quality wines that are high in both citric acid and sulphates. Interestingly, there is a moderately positive relationship betweeen citric acid and sulphates (correlation coefficient .313)
##
## Pearson's product-moment correlation
##
## data: wine$alcohol and wine$sulphates
## t = 3.7568, df = 1597, p-value = 0.0001783
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.04477906 0.14196454
## sample estimates:
## cor
## 0.09359475
It seems as if the higher quality wines are fairly spread out in this plot. There is a weak correlation between alcohol and sulphates (correlation coefficient is .094).
The two variables that I examined here that had the strongest relationship were citric acid and sulphates, while the weakest were sulfates and alcohol.
Although higher quality wines tend to have higher levels of alchohol and citric acid, there is a weaker relationship between alcohol and citric acid than I was expecting.
Although I was hoping to find an interesting relationship between sweetness and quality, this plot represents one of the weakest correlations that I discovered between two variables.
I chose this plot because it showed a surprisingly strong correlation between sulphates and citric acid. This relationship might warrant further investigation.
I learned a great deal through the process of completing this project. I was interested doing some background research into the chemical components of wine, of which I knew very little. It was exciting to learn that based on the level of residual sugar in the wines, I could create a new categorical variable that labeled the wines according to how dry or sweet the wines were. It was disappointing to find out that the vast majority of wines were dry, and that the degree of sweetness had no significant correlation to the quality of the wine.
However, it was interesting to see that levels of alchohol, citric acid and sulphates did have a positive relationship with quality, and to also confirm that alchohol and density would have a strong correlation with each other. It was surprising to find that sulphates and citric acid had a moderately positive correlation.
I enjoyed this project because it showed me how doing EDA can lead to better and more informed questions about your dataset– I found some interesting relationships but know I would have to dig deeper in order to draw any conclusions.
In the future, it would be interesting to ammend the dataset to include a wider sample and range of variables such as residual sugar levels, as well as other factors such as where the grapes used to make the wine where from, what yearthe wine was created, etc. I think that would be helpful to make broader conclusions about what variables influence quality.